None
|
IS215 - Reduce Gym Membership Churn Dataset |
|---|
This notebook is used to show how Anytime Fitness can identify features that contribute to customer churn as part of IS215 Digital Business Transformation and Technology project.
The purpose of proposing digital transformation for Anytime Fitness is to
Using example dataset provided by Model Fitness (customer churn), Anytime Fitness can similarly identify the factors that contribute to customer churn and formulate strategies to retain them so that it can decrease its membership churn.
If using conda
is215env.yaml fileconda env create -n is215 --file is215env.yaml to create a conda environment named is215 from is215env.yamlbase environmentKernel > conda env:is215 from the menu bar in jupyter notebook to activate the newly created is215 conda environment.Else if Using pip
pip install <packagename> for each of the package that is listed in the dependencies below'Churn' — the fact of churn for the month in question'gender''Near_Location' — whether the user lives or works in the neighborhood where the gym is located'Partner' — whether the user is an employee of a partner company (the gym has partner companies whose employees get discounts; in those cases the gym stores information on customers' employers)Promo_friends — whether the user originally signed up through a "bring a friend" offer (they used a friend's promo code when paying for their first membership)'Phone' — whether the user provided their phone number'Age''Lifetime' — the time (in months) since the customer first came to the gym'Contract_period' — 1 month, 3 months, 6 months, or 1 year'Month_to_end_contract' — the months remaining until the contract expires'Group_visits' — whether the user takes part in group sessions'Avg_class_frequency_total' — average frequency of visits per week over the customer's lifetime'Avg_class_frequency_current_month' — average frequency of visits per week over the preceding month'Avg_additional_charges_total' — the total amount of money spent on other gym services: cafe, athletic goods, cosmetics, massages, etc.import os
os.environ["OMP_NUM_THREADS"] = '5' # set to avoid memory leak warning when using k-means classifier model
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, ConfusionMatrixDisplay
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.cluster import KMeans
# Visualisation libraries
import matplotlib.pyplot as plt
import plotly.express as px
df = pd.read_csv("./gym_churn_us.csv")
df.sample(5)
| gender | Near_Location | Partner | Promo_friends | Phone | Contract_period | Group_visits | Age | Avg_additional_charges_total | Month_to_end_contract | Lifetime | Avg_class_frequency_total | Avg_class_frequency_current_month | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2334 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 33 | 369.234321 | 1.0 | 3 | 1.674872 | 1.593047 | 0 |
| 2806 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 25 | 227.239408 | 1.0 | 3 | 3.060350 | 3.118923 | 0 |
| 875 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 29 | 49.099452 | 1.0 | 6 | 2.631843 | 2.656269 | 0 |
| 3698 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 33 | 60.202737 | 1.0 | 1 | 1.559161 | 1.520917 | 0 |
| 927 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 35 | 114.347092 | 1.0 | 7 | 3.568346 | 3.548644 | 0 |
df.info() # no null values
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4000 entries, 0 to 3999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 gender 4000 non-null int64 1 Near_Location 4000 non-null int64 2 Partner 4000 non-null int64 3 Promo_friends 4000 non-null int64 4 Phone 4000 non-null int64 5 Contract_period 4000 non-null int64 6 Group_visits 4000 non-null int64 7 Age 4000 non-null int64 8 Avg_additional_charges_total 4000 non-null float64 9 Month_to_end_contract 4000 non-null float64 10 Lifetime 4000 non-null int64 11 Avg_class_frequency_total 4000 non-null float64 12 Avg_class_frequency_current_month 4000 non-null float64 13 Churn 4000 non-null int64 dtypes: float64(4), int64(10) memory usage: 437.6 KB
# Check if there are any rows where months to end contract > contract period
df[df['Month_to_end_contract'] > df['Contract_period']]
| gender | Near_Location | Partner | Promo_friends | Phone | Contract_period | Group_visits | Age | Avg_additional_charges_total | Month_to_end_contract | Lifetime | Avg_class_frequency_total | Avg_class_frequency_current_month | Churn |
|---|
# Rename some columns
df = df.rename(columns={
'Month_to_end_contract': 'Num_months_to_contract_end',
'Avg_class_frequency_total': 'Avg_weekly_visits_total', # average frequency of visits per week over the customer's lifetime
'Avg_class_frequency_current_month': 'Avg_weekly_visits_preceding_month'# average frequency of visits per week over the preceding month
})
df.describe(include='all')
| gender | Near_Location | Partner | Promo_friends | Phone | Contract_period | Group_visits | Age | Avg_additional_charges_total | Num_months_to_contract_end | Lifetime | Avg_weekly_visits_total | Avg_weekly_visits_preceding_month | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 |
| mean | 0.510250 | 0.845250 | 0.486750 | 0.308500 | 0.903500 | 4.681250 | 0.412250 | 29.184250 | 146.943728 | 4.322750 | 3.724750 | 1.879020 | 1.767052 | 0.265250 |
| std | 0.499957 | 0.361711 | 0.499887 | 0.461932 | 0.295313 | 4.549706 | 0.492301 | 3.258367 | 96.355602 | 4.191297 | 3.749267 | 0.972245 | 1.052906 | 0.441521 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 18.000000 | 0.148205 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 27.000000 | 68.868830 | 1.000000 | 1.000000 | 1.180875 | 0.963003 | 0.000000 |
| 50% | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 | 0.000000 | 29.000000 | 136.220159 | 1.000000 | 3.000000 | 1.832768 | 1.719574 | 0.000000 |
| 75% | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 6.000000 | 1.000000 | 31.000000 | 210.949625 | 6.000000 | 5.000000 | 2.536078 | 2.510336 | 1.000000 |
| max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 12.000000 | 1.000000 | 41.000000 | 552.590740 | 12.000000 | 31.000000 | 6.023668 | 6.146783 | 1.000000 |
df['Contract_period'].value_counts()
1 2207 12 960 6 833 Name: Contract_period, dtype: int64
X = df.drop('Churn', axis=1)
y = df['Churn']
def preprocess_data(X):
# for numerical columns
# ---------------------
# we normalize using MinMaxScaler to constrain values between 0 and 1
scaler = MinMaxScaler(feature_range = (0,1))
df_numeric = X.select_dtypes(include=[np.number])
numeric_cols = df_numeric.columns.values
for col in numeric_cols:
# fit_transform() of scaler can be applied to each column individually
X[col] = scaler.fit_transform(X[[col]])
return X
X = preprocess_data(X)
X.describe()
| gender | Near_Location | Partner | Promo_friends | Phone | Contract_period | Group_visits | Age | Avg_additional_charges_total | Num_months_to_contract_end | Lifetime | Avg_weekly_visits_total | Avg_weekly_visits_preceding_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 | 4000.000000 |
| mean | 0.510250 | 0.845250 | 0.486750 | 0.308500 | 0.903500 | 0.334659 | 0.412250 | 0.486272 | 0.265721 | 0.302068 | 0.120153 | 0.311939 | 0.287476 |
| std | 0.499957 | 0.361711 | 0.499887 | 0.461932 | 0.295313 | 0.413610 | 0.492301 | 0.141668 | 0.174417 | 0.381027 | 0.120944 | 0.161404 | 0.171294 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.391304 | 0.124394 | 0.000000 | 0.032258 | 0.196039 | 0.156668 |
| 50% | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.478261 | 0.246310 | 0.000000 | 0.096774 | 0.304261 | 0.279752 |
| 75% | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 0.454545 | 1.000000 | 0.565217 | 0.381581 | 0.454545 | 0.161290 | 0.421019 | 0.408398 |
| max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
# Split feature and label sets to train and data sets - 80-20
# random_state is desirable for reproducibility
# stratify - to keep the same proportion of (yes and no) as input dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 10, stratify = y)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)
(3200, 13) (800, 13) (3200,) (800,)
# create a logistic regression classifier
# ---------------------------------------
logreg = LogisticRegression(solver='liblinear', random_state=7)
logreg.fit(X_train, y_train)
# Apply the model on the test data to get a prediction
# ----------------------------------------------------
y_pred = logreg.predict(X_test)
y_pred.shape
(800,)
n = 10
print(f"Probability of first {n} users not churning: \n {logreg.predict_proba(X_test)[:,0][:n]}", end='\n\n') # probability of not churning
print(f"Probability of first {n} users churning: \n {logreg.predict_proba(X_test)[:,1][:n]}", end='\n\n') # probability of churning
print(f"Whether first {n} users churned", y_pred[:n]) # using default threshold of 0.5 to decide if user churn (probability > 0.5) or not (probability < 0.5)
Probability of first 10 users not churning: [0.80407839 0.89458534 0.3542922 0.97464291 0.09258459 0.85284717 0.73054158 0.82518119 0.0756298 0.01948577] Probability of first 10 users churning: [0.19592161 0.10541466 0.6457078 0.02535709 0.90741541 0.14715283 0.26945842 0.17481881 0.9243702 0.98051423] Whether first 10 users churned [0 0 1 0 1 0 0 0 1 1]
# Get accuracy of the model
# -------------------------
print('Testing accuracy %s' % accuracy_score(y_test, y_pred))
# compute confusion matrix
# ------------------------
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
print(classification_report(y_test, y_pred))
Testing accuracy 0.9175
precision recall f1-score support
0 0.92 0.97 0.95 588
1 0.90 0.77 0.83 212
accuracy 0.92 800
macro avg 0.91 0.87 0.89 800
weighted avg 0.92 0.92 0.92 800
tn, fp, fn, tp = cm[0][0], cm[0][1], cm[1][0], cm[1][1]
# Precision: Given model predict positive/negative, how good ('precise') is the model in predicting positive/negative
precision_no_churn = tn / (tn + fn)
print(f'precision_no_churn is {precision_no_churn:.3f}')
precision_churn = tp / (tp + fp)
print(f'precision_churn is {precision_churn:.3f}')
#Recall: Given actual is positive/negative, how good is the model in predicting ('recalling') positive/negative
recall_no_churn = tn / (tn + fp)
print(f'recall_no_churn is {recall_no_churn:.3f}')
recall_churn = tp / (tp + fn)
print(f'recall_churn is {recall_churn:.3f}')
precision_no_churn is 0.922 precision_churn is 0.901 recall_no_churn is 0.969 recall_churn is 0.774
Logistic regression is quite accurate in predicting customer churn (91% accuracy).
Precision:
no churn (0), model is correct 92% of the timechurn (1), model is correct 90% of the timeRecall
no churn (0), model is correct 97% of the timechurn (1), model is correct 77% of the timeModel is not very accurate in predicting churn given outcome is churn.
### Possible evaluation with another classificaion model e.g. decision tree if using model to predict if new customers will churn or not
n features that contribute most to churn.¶Come up with strategies that change these features to retain customers.
n = 6
rfe = RFE(logreg, n_features_to_select=n)
fit = rfe.fit(X, y)
print(f'Number features: {fit.n_features_}')
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)
Number features: 6 Selected Features: [False False False False False True False True True False True True True] Feature Ranking: [7 6 5 4 8 1 3 1 1 2 1 1 1]
selected = fit.support_
feature_names = X.columns[selected]
feature_names
Index(['Contract_period', 'Age', 'Avg_additional_charges_total', 'Lifetime',
'Avg_weekly_visits_total', 'Avg_weekly_visits_preceding_month'],
dtype='object')
for i in feature_names:
print(i)
Contract_period Age Avg_additional_charges_total Lifetime Avg_weekly_visits_total Avg_weekly_visits_preceding_month
For those who churn, group them into user profiles/clusters based on the most significant fields. Focus on retaining the particular groups identified.
churn_df = df[df['Churn'] == 1].copy()
# churn_df.describe()
churn_features_df = churn_df[feature_names].copy()
churn_features_df = preprocess_data(churn_features_df)
churn_features_df.describe()
| Contract_period | Age | Avg_additional_charges_total | Lifetime | Avg_weekly_visits_total | Avg_weekly_visits_preceding_month | |
|---|---|---|---|---|---|---|
| count | 1061.000000 | 1061.000000 | 1061.000000 | 1061.000000 | 1061.000000 | 1061.000000 |
| mean | 0.066233 | 0.449482 | 0.270189 | 0.110064 | 0.424014 | 0.295047 |
| std | 0.193852 | 0.144758 | 0.182649 | 0.123422 | 0.199706 | 0.217564 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 0.350000 | 0.118671 | 0.000000 | 0.290565 | 0.119013 |
| 50% | 0.000000 | 0.450000 | 0.243699 | 0.111111 | 0.428669 | 0.276658 |
| 75% | 0.000000 | 0.550000 | 0.388984 | 0.111111 | 0.562414 | 0.448716 |
| max | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 | 1.000000 |
SSE = []
for i in range(1,6):
KMeans_model = KMeans(n_clusters=i, random_state=42, n_init=10, max_iter=300)
KMeans_model.fit(churn_features_df)
SSE.append(KMeans_model.inertia_)
# Plot the SSE to decide the n_clusters to use based on elbow plot
labels = {'x': 'No. of Clusters', 'y': 'SSE'}
title = 'Sum of Squared Errors (SSE) to Centroid Based on No. of Clusters'
fig = px.line(x=range(1,6), y=SSE, labels=labels, title=title)
fig.show()
Can choose either 3 or 4 clusters based on elbow plot
KMeans_model = KMeans(n_clusters=3, n_init=10, max_iter=300, random_state=42)
KMeans_model.fit(churn_features_df)
KMeans(n_clusters=3, n_init=10, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=3, n_init=10, random_state=42)
user_profile_df = pd.DataFrame(data=KMeans_model.cluster_centers_, columns=churn_features_df.columns)
user_profile_df
| Contract_period | Age | Avg_additional_charges_total | Lifetime | Avg_weekly_visits_total | Avg_weekly_visits_preceding_month | |
|---|---|---|---|---|---|---|
| 0 | 0.559969 | 0.433193 | 0.291236 | 0.120448 | 0.467211 | 0.320948 |
| 1 | 0.001055 | 0.454408 | 0.271714 | 0.100541 | 0.585992 | 0.480380 |
| 2 | 0.006227 | 0.449119 | 0.264000 | 0.115677 | 0.277335 | 0.132697 |
labels={'x': 'User Profile'}
heatmap = px.imshow(user_profile_df.T, x=user_profile_df.index, y=user_profile_df.columns,
labels=labels, color_continuous_scale='Viridis', title='Heatmap of Churn User Profiles')
heatmap.show()
cluster_labels = KMeans_model.labels_ # cluster label for each user that churn
cluster_labels
array([2, 1, 2, ..., 2, 2, 2])
churn_df['Cluster'] = cluster_labels
churn_df
| gender | Near_Location | Partner | Promo_friends | Phone | Contract_period | Group_visits | Age | Avg_additional_charges_total | Num_months_to_contract_end | Lifetime | Avg_weekly_visits_total | Avg_weekly_visits_preceding_month | Churn | Cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 7 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 30 | 217.786641 | 1.0 | 0 | 1.213502 | 1.049387 | 1 | 2 |
| 11 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 27 | 266.783824 | 1.0 | 2 | 2.348535 | 1.584316 | 1 | 1 |
| 13 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 27 | 152.748668 | 1.0 | 1 | 0.000000 | 0.000000 | 1 | 2 |
| 21 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 30 | 153.198594 | 1.0 | 3 | 0.636382 | 0.000000 | 1 | 2 |
| 23 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 32 | 276.079882 | 1.0 | 1 | 1.420637 | 1.043101 | 1 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3982 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 24 | 135.395712 | 1.0 | 0 | 1.390251 | 0.789321 | 1 | 2 |
| 3984 | 1 | 1 | 0 | 0 | 1 | 6 | 1 | 22 | 101.107544 | 6.0 | 2 | 1.377860 | 0.343915 | 1 | 0 |
| 3986 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 34 | 226.804845 | 1.0 | 0 | 1.399529 | 0.304130 | 1 | 2 |
| 3994 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 26 | 31.066721 | 1.0 | 1 | 1.157451 | 0.269597 | 1 | 2 |
| 3996 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 29 | 68.883764 | 1.0 | 1 | 1.277168 | 0.292859 | 1 | 2 |
1061 rows × 15 columns
cluster_churn_count = churn_df['Cluster'].value_counts()
cluster_churn_count
2 512 1 430 0 119 Name: Cluster, dtype: int64
title="% of customers churned in each cluster"
px.pie(values=cluster_churn_count, names=cluster_churn_count.index, title=title)
Based on the cluster group, the gym e.g. Anytime Fitness should focus retention strategies on cluster 1 and 2 (regular and new customers with short contract period) as they comprise of majority of the customers who churn (>88%).
Use identified features that affect churn or other features to find out reasons/patterns why users in the clusters 1 and 2 churn.
Identified features that affect churn:
{{", ".join(feature_names)}}
Questions to ask to identify potential reasons why customers churn
churn_df['Contract_utilisation_pct'] = (churn_df['Num_months_to_contract_end'] / churn_df['Contract_period']) * 100
churn_df[['Num_months_to_contract_end', 'Contract_period', 'Contract_utilisation_pct']].head()
| Num_months_to_contract_end | Contract_period | Contract_utilisation_pct | |
|---|---|---|---|
| 7 | 1.0 | 1 | 100.0 |
| 11 | 1.0 | 1 | 100.0 |
| 13 | 1.0 | 1 | 100.0 |
| 21 | 1.0 | 1 | 100.0 |
| 23 | 1.0 | 1 | 100.0 |
Question
Did the churn users from cluster 1 and 2 fully utilised their contract period or not?
fig = px.histogram(churn_df, x="Contract_utilisation_pct",
color='Cluster', barmode='group',
height=600, title="Contract Utilisation % for Churn users in each Cluster", )
fig.update_xaxes(title_text='Contract Utilisation %')
fig.show()
Finding
Most churn users from cluster 1 and 2 did fully utilise their contract period. This suggest these users were frequent patrons until their contract ended.